Elden Ring Geoguesser

Elden Ring's world is distinct in a way most games aren't. Each area has such a strong identity that most players can immediately tell if they're in Caelid with its rot-red sky or the foggy Lakes of Liurnia. A model that takes a screenshot as input and outputs coordinates, could that work too?

An Elden Ring landscape screenshot — Where could this be?

Gathering Data

First, lots of data.

One option would be to automatically take screenshots (good) and label them by hand (bad). Labeling by hand is tedious and scales terribly. Since we're in a game world, it should be possible to extract that information from the game itself. Enter Cheat Engine, a program that reads data directly from memory. It's often used for cheating, but here it's a great tool to read Elden Ring's memory and pull out the coordinates we need. To find the right memory addresses we use the Elden Ring Cheat Engine table maintained by The Grand Archives.

Some Lua scripting later and we have a tool that extracts player coordinates every second.


    IDS = {
      anim   = 100970,  -- Current Animation
      played = 100971,  -- Length played [seconds]
      cx     = 107824,  -- Camera X
      cy     = 107825,  -- Camera Y
      cz     = 107826,  -- Camera Z
      map    = 107917,  -- Map ID (hex)
      px_l   = 107924,  -- Local X
      pz_l   = 107925,  -- Local Z
      py_l   = 107926,  -- Local Y
      px_g   = 107928,  -- Semi-Global X
      pz_g   = 107929,  -- Semi-Global Z
      py_g   = 107930,  -- Semi-Global Y
    }
    
    local function getVal(id) -- Function to extract data
      local rec = AddressList.getMemoryRecordByID(id)
      if rec == nil then return nil end
      return rec.Value
    end

    looping, parsing and saving...

Combined with a simple Python screenshotting tool we get:

Map of the Lands Between with red dots marking screenshot positions along a walking path — Each red dot is a position with a screenshot

More data would be better, but it's enough for a decent prototype. On to modelling.

(There's also a bunch of coordinate math involved to convert local chunk coordinates into global ones. Check out this image by FloorBelow and soulsmodding.wikidot.com for details.)

Building a Baseline Model

The model we use is DINOv2, a vision model trained to produce useful image representations without needing labeled data. Rather than training it to classify images, we let it memorize a set of screenshots by converting them into embeddings, compact numerical descriptions of what's in each image. When we show it an unknown screenshot, it finds the most similar ones from its memory and returns their coordinates as a guess.

Map showing a red dot for the model's guess and a green dot for the actual location — The red dot is the guess, the green dot the actual location.

Not so promising. The input screenshot is on the left, the 5 most similar matches on the right.

Input screenshot alongside 5 similar screenshots retrieved by the model

The median error is 309 world units, roughly the distance from the Church of Elleh to the Gatefront Ruins. The mean at 687 is basically twice that.

Map showing the distance between Church of Elleh and Gatefront Ruins

It kind of works, but has two main problems. If an image from the north and one from the south both look similar, the model averages them and lands in the sea. It also focuses on the wrong things, detecting whether the player is on a horse rather than paying attention to background details. DINOv2 was trained on generic images, not Elden Ring screenshots, so this makes sense. Time to fix that.

Fine Tuning

The base DINOv2 model had no reason to care about location. It might decide two screenshots look similar because both have the player on horseback, ignoring the background entirely. Contrastive fine-tuning teaches it a simple lesson: screenshots taken nearby should feel similar, screenshots taken far apart should feel different. After enough of those comparisons, the model stops noticing the horse and starts noticing that the ruins in the background are distinctly Limgrave.

The error dropped significantly. Median went from 308.9 down to 29.5, and mean from 687.1 to 245.8. We went from a median error of "Church of Elleh to the Gatefront Ruins" to roughly a tenth of that.

Life Test

That's the next and final step..